Unicorns are real. They are called Rhinoceros.
Chipseq data analysis and peak calling.
Taken from Systems Biology 2019, module on Epigenomics by Oleg Shpynov and Roman Chernyatchik.
Evolution does not work on blank slate. It works with the preexisting information and builds upon it. In context of epigenomics, we have regulatory DNA.
Genomics is the sum total of genes and genic product (Transcriptomics - different types of RNA). We do NGS, RNAseq, Variant, Copy Number Variation etc. analyses for these.
In Epigenomics we are concerned with the regulation of the genome. Here is where environment/ lifestyle can actually affect the individual.
Epigenomics also plays a key role in development e.g. different tissues of the body share although the same DNA but different epigenetic patterns that results in specificity of their function. Thus differentiation, de-differentiation, re-differentiation etc. can be seen as Epigentic changes. It is also noteworthy that these mechanisms vary greatly between plants and animals, an early evolutionary split.
Examples: 1.) the Dutch mass starvation during second world war resulted in epigenetic changes which can still be seen 6 generations after the incident. Their metabolism shifted to a conservative mode i.e. preserving energy from diet. Other exampl
2.) X barr inactivation in females. There are two copies of X chromosomes, one of these is inactivated during embryonic development and remains highly condensed.
3.) Mixed coat of Cats, red fur in these furry animals, mostly females. Sometimes male cats also show this condition, they have XXY condition and are sterile. Note: When it is XXXY male cat they are not sterile. An analogy could be bacterial Lac operon. Where if Lactose is present it will switch on the gene responsible for lactose breakdown.
https://en.wikipedia.org/wiki/DNA_methylation
In January Sys Biol module there was a lecture where segnor Oleg described their research about whether Epigenetic markers change in Humans. The answer was 'no', they were surprised by the result. Another lab in some other part of the world also came upon the same conclusion.
This is surprising because in Mice these markers change with age. Suggesting that in Humans this process of Epiregulation is complicated.
Packaging of DNA
If we look at the packaging:
DNA -wrapped around---> histone octamer -multiple histones form---> nucleosomes --multiple nucleosomes---> solenoids (cylindrical supra-assemblies) --forms----> chromatin --form---> chromatin loops (if they are too tight: hetro; else: euchromatine) ----further condense to form-----> chromosome
Epigentic regulation can occur at any of the stages given above.
Examples:
1.) For DNA, at individual nucleotide base level, we can have methylation (methyltransferases enzyme), hydroxy-methylation, sulphynation etc. This results in physical blockage of DNA structure, such that replication motor cannot move over the strand and transcription is not possible. This can be thought of as a similar process to protein activation/ inactivation by mechanism such as phosphorylation, only that DNA methylation is the boss of all.
2.) Histone level includes modifications of histone molecules which activates them to wrap tightly or loosely around the DNA. The resulting subsequent structures will also be tightly bound.
Mirny, L.A., Chromosome Res 19, pp 37–51 (2011)
Nucleosome is a equilibrium globule i.e. it allows for easier access of DNA relatively independent of regions. Many such Nucleosomes come together to form a fractal globule.
There are two types of chromatine: Heterochromatin and Euchromatine. 'Eu' means 'true' i.e. his region of chromatine is open for transcription as opposed to 'hetro' which is tightly bound. So the inactivated X (barr body) can be seen as heterochromosome as the entire chromatin on this chromosome is heterochromatine.
Thus the process of chromatin activation (conversion of hetero to Eu chromatin). Note: this is not the same as saying decondensation of chromosome, decondensation is the process of unpacking of chromatids into the original state after cell division [https://www.sciencedirect.com/science/article/pii/S0955067416300059]. Activation means that they are actually ready for transcription. REMEMBER!! Constitutive heterochromatin is irreversable and Facultative heterochromatin is reverible. This means that if heterochromatine is constituent of the organism it won't change to euchromaine, thus any gene which has been constitutively heterochromatisised is irreversibly silenced forever. Check out the image below.
At loop level, distant regions of DNA come together for transcription of certain genes. If this process is interrupted, say for example, by binding of a protein, we will see no transcription. Thus looping is also crucial not just for condensation but transcription of certain genes. Observe the picture below to understand.
To summarize: There are 2 major definations of epigenetics:
○ Heritable changes that do not involve DNA modification (still heavily debated) -- changes in metabolic patterns, onset of diseases
○ DNA modification and packaging changes that influence expression (well established) -- regulation of genic function
Also depending on the process:
○ differentiating cells -- switching on/off of genes to achieve functional roles
○ pluripotent cells -- completely open genome, for the sake of differentiating in any other kind of cell in the body
○ differentiated cells -- after achieving a particular role, changes in the metabolic patterns. An analogy can be
bacterial Lac, His operons etc.

Epigenetics Regulation via:
Histone Modifications
Chromatin accessibility and 3D structure
Non-Coding RNAs
DNA cytosine methylation
Look at this image, these are the modifications we are also going to analyse for our data:
Cut to the chase summary:
1.)
Narrow peaks of H3K4me3 mark promoters. Enzyme that methylates K4 binds only to non-CpG-methylated promoters (mouse heart)!
2.)
Enhancers are associated with H3K4me1 and H3K27ac. H3K27ac is thought to distinguish active enhancers from poised. The word poised means the gracefully static, just like the ballerina when she is en-point. Thus a poised gene is a gene that was suppressed but is now ready to be activated.
H3K27me3 marks suppressed genes poised to be activated.
Stem cells can have both H3K4me3 and H3K27me3 – unique!
In this image, same promoter region has both tri-methylation marks. Such region is a bivalent chromatin domain and are shown to exist also in embryonic stem cells. These poise gene for activation while also keeping them repressed!
3.)
Transcriptional elongation is marked my H3K36me3 and H3K79me2 (Prob cells, Mouse).
4.)
Marks H3K9me2 and H3K9me3 are strongly associated with heterochromatin.
Jacobs, S.A.; Khorasanizadeh, S., Science, 295(5562), pp 2080-3 (2002), Binding with protein HP1
When we think of methylation, we mostly think about CpG islands. The CpG (C preceding G) islands is a region with high frequency of CG's. 60% human genes has CGI in promoter. Methylation/ demethlyation of these islands can turn genic activity on/off. At transcription start site (TSS) we find minimum number of CG islands, even in unexpressed genes.
But of-course CpG islands or mCG (modification at CG) are not the only way methylation works. We have methylation at other bases called modification at CH where H is either A, T, or C.
The following image shows that mCH are most common in neurons. [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4729449/]
Observe the following image:
Earlier, DNA methylation was thought to have only a 'silencing' effect on the genome. But methylome, as you can see has a variety of function based on the genetic context e.g. transcriptional start sites with/without CpG islands, in gene bodies, at regulatory elements, in repeat sequences all have a different function.
The end-result of this pipeline would help us understand this picture even better.
Wet lab
1.) Destroy the cells 2.) Add formaldehyde to chemically bind proteins to DNA 3.) Break the DNA into smaller fragments (using ultrasound -- sonification) 4.) Precipitate by binding to antibodies (immunoprecipitation) 5.) Elute (extract) the fragments 6.) Construct the library for PCR 7.) Sequencing of the fragments
Dry lab
7.5) Quality checking 8.) Alignment of reads to a reference genome using bowtie2 9.) Sorting and indexing of bams using samtools 9.5) Visualisation using JBR (JetBrains Research (correct me if I am wrong) -Genome Browser) 10.) Peak calling:
10.1.) Macs2 10.2.) Sicer 10.3.) SPAN
11.) Getting most confident peaks 12.) Annotaing the peaks in JBR 13.) Model training 14.) Model tuning
15.) Interpretation
The data is taken from following paper:
LTR retrotransposons transcribed in oocytes drive species-specific and heritable changes in DNA methylation [https://www.nature.com/articles/s41467-018-05841-x]
If you search for keyword 'GSE' on this paper you'll get Gene Expression Omnibus accession: GSE112622
From there you can get an SRP (Sequence Read Project) ID.
We are intrested in Chip-Seq samples, particularly: ● GSM3074494 ● GSM3074495
We can download GSM3074495 FASTQ Files using fastq dump.
But we already have the data, since the data is large only 15th chromosome is taken.
All the respective tracks are also downloaded in the data folder: ENCODE project https://www.encodeproject.org
Please look at the presentation for complete pipeline. Let's discuss the script line by line with results.
Open terminal and get tree. Let's look at the directory structure.
We need to modify the scripts according to this directory structure. We will only modify the main.sh file. Please extract this zip in you any directory.
sudo apt-get install tree
tree
In order for the script to work, the structure should look like this.
All the genomes will be in raw directory. The script will make symbolic links in data directory. And execute all commands to give the final data.
#########################################################################################
# DEPENDENCIES
# fastqc for individual file quality check
sudo apt-get install fastqc
# multiqc for combined quality check
pip install multiqc
# bowtie for alignment
sudo apt-get install bowtie
# deeptools for analysis of deep-seq data
conda install -c bioconda deeptools
# latest samtools
conda install -c bioconda samtools
# pybigwig: it is an extension for deeptools for quick accessing bigWig and bigBed files
conda install -c bioconda/label/cf201901 pybigwig
# deeptools_intervels: it is another python extension for deeptools for constructing interval trees with
associated exon/annotation information
conda install -c bioconda deeptoolsintervals
# pysam: Pysam is a python module for reading and manipulating Samfiles
conda install -c bioconda pysam
# Macs2 peak caller
conda install -c bioconda macs2
# Scicer peak caller
conda install -c bioconda scicer
#########################################################################################
#########################################################################################
$PWD =pwd
# bash: {current directory}: Is a directory
mkdir $PWD/main/data
ln -s $PWD/data/* $PWD/main/data/
bash $PWD/main/scripts/fastqc.sh $PWD/main/data
multiqc -f -o $PWD/main/data/fastqc $PWD/main/data/fastqc
bash $PWD/main/scripts/bowtie.sh hg19 $PWD/data/index/ 0 $PWD/main/data
#########################################################################################
The tree should look like this now.
#########################################################################################
multiqc -f -o $PWD/main/data/bams_qc $PWD/main/data/*_bowtie*.log
#########################################################################################
The multiqc report should be generated now.
Now we are going to use convert our data into bigWig format. BigWig is a cool format that allows for fast visualisation of the alignment data. But the trade off is that these files are quite big in size. On UCSC browser what we see is actually huge condensed data as a graph. For the sake of definition:
bigWig files are indexed binary files which contain alignment data to the genome. It is similar to bed file which contains location of the genome where the read aligned to. Because it is indexed, it can be used to quickly find the location you want.
We will use JBR browser to visualise our data.
#########################################################################################
bash $PWD/main/scripts/bigwig.sh $PWD/main/data
#########################################################################################
Our data is now ready for peak calling.
We will run peak calling on two settings, at 0.05 with NARROW mode (abscence of --broad parameter) and 0.1 which has BROAD mode on by specifying --broad parameter. Narrow mode has a stringent cutoff while broad mode has a relaxed cutoff,
Both have their pros and cons, for example, BROAD mode covers more peaks but NARROW mode will give less peak with good quality.
#########################################################################################
bash $PWD/main/scripts/macs2.sh $PWD/main/data hg19 "q0.05" "-q 0.05"
ls $PWD/main/data/*macs*.log
bash $PWD/main/scripts/macs2.sh $PWD/main/data hg19 "broad_0.1" "--broad --broad-cutoff 0.1"
#########################################################################################
Let's look at the executed command.
Now is a good time to organise the folders for further analysis. Because we are going to do some mental gymnastics.
Peak calling programs help to define sites of Protein:DNA binding by identifying regions where sequence reads are enriched in the genome after mapping. The common assumption is that the ChIP-seq process is relatively unbiased so reads should accumulate at sites of protein binding faster than in background regions of the genome.
MACS is (for TF peaks) one of the most popular peak callers, it is also one of the oldest and this probably contributes to its success. It is a good method, good enough for many experimental conditions and requires very little justification if cited as the tool used in a publication. MACS performs removal of redundant reads, read-shifting to account for the offset in forward or reverse strand reads. It uses control samples and local statistics to minimize bias and calculates an empirical FDR.
[https://epigenie.com/guide-peak-calling-for-chip-seq/]
We specified the CHIP-seq file and control file. We will have only 5 outputs because one of the bams serve as a control. In our case it's CD4 lymphocytes. Defining the read length parameter is optional because MACS2 can detect read length automatically.
It will then remove the duplicate reads. IT odes so by calculating the maximum number of duplicate reads in a single position taking into account the sequencing depth and will remove any read which is in excess of the calculated sequencing depth. This is done to prevent any bias in the further caluclations.
We have an effective genome size is about 10 % because we only have the 15th chromosome. It's nothing to concern in our situation.
Our bandwidth is 300. In other words, a window will slide where the algorithm will try to find enriched regions which have M-fold sensitivity. The size of window is 2 times the bandwidth. The expected background is the number of reads times their length divided by the mappable genome size. Note that the mappable genome size is always less than the real genome size because of repetitive sequence.
The regions' fold enrichment must be higher than 10 and less than 30, but you can change these values if not enough regions are found. A smaller value for the lower cutoff provides more regions for model building, but it can also include spurious data into the model and thereby adversely affect the peak finding results. MACS2 uses 1000 enriched regions to model the distance d between the forward and reverse strand peaks.
In the actual peak detection phase, MACS2 extends the reads in the 3' direction to the fragment length obtained from modeling. If the model building failed or if it was switched off, the reads are extended to the value of the extension size parameter.
If a control sample is available (as in our case), MACS2 scales the samples linearly to the same read number. It then selects candidate peaks by scanning the genome again, now using a window size which is twice the fragment length.
MACS2 calculates a p-value for each peak using a dynamic Poisson distribution to capture local biases in read background levels. If there is a control sample, it calculates the local background. In case we are calling peaks with no control, we need to change the mode.
Finally, q-values are calculated using the Benjamini-Hochberg correction. We see that as model fold [5, 50] for tuning this parameter.
#########################################################################################
# Organise our data by folders
mkdir $PWD/main/macs2
mv $PWD/main/data/*q0.05* $PWD/main/macs2/
mkdir $PWD/main/macs2_broad
mv $PWD/main/data/*broad* $PWD/main/macs2_broad/
mkdir $PWD/main/bw
mv $PWD/main/data/*.bw $PWD/main/bw
###########
# Use this command (and modify) to get most confident peaks
###########
cat $PWD/main/macs2_broad/GSM1102785_CD14_H3K27me3_hg19.chr15_hg19_broad_0.1_peaks.broadPeak | sort -k9,9r
#########################################################################################
Whatever the selected mode, we can always get the most confident peaks by sorting according the quality and extracting the best peaks.
#########################################################################################
# Organise our data by folders
mkdir ~/macs2
mv $PWD/main/data/*q0.05* ~/macs2/
mkdir ~/macs2_broad
mv $PWD/main/data/*broad* ~/macs2_broad/
mkdir ~/bw
mb $PWD/main/data/*.bw ~/bw
#########################################################################################
Before we move further let's discuss the peak results a bit. Here is the tree structure of the main directory now.
We have 5 samples and 1 control file (.bw). Then 5 narrow peak Macs2 result, and 5 broad peak Macs2 result.
The analysis results consist of the following files:
macs2-peaks.tsv: List of peaks and their length, summit location and height, as well as their fold change, p- and q-value. Note that you can visualize this file in the JBR genome browser.
macs2-(narrow/broad)peak.bed: List of peak locations, q-values, fold change, p-value, q-value (again) and summit position relative to peak start, in narrow peak format (BED6+4).
macs2-summits.bed: List of peak summits and q-values in BED format.
macs2-log.txt: A log file listing the output from the various steps, which can be useful for diagnostic purposes and to get to know the details of the process.
macs2-model.pdf: If the peak model building is successful, a plot of the model is generated. The shape of the modeled peaks allows you to assess the quality of the model.
So far so good. But let's discuss one of these results in a bit detail.
The red peak is from Watson (forward) strand, while blue is from Crick (reverse) strand.
The two key features of MACS are: empirical modeling of 'd' (distance) and tag shifting by d/2 to putative protein-DNA interaction site; and the use of a dynamic λ (as in the lambda parameter of a distribution) local to capture local biases in the genome.
You can imagine this as fitting different curves (smoothening procedure for peak) by altering the width and lambda of the curve.
The effectiveness of the dynamic λ{local} is assessed by comparing MACS to a procedure that uses a uniform λ{background} from the genome background.
In other words background helps in maximising signal to noise ratio.
The spatial resolution, defined as the average distance from the peak summit to the nearest FKHR motif, are greatly improved by using tag shifting and the dynamic λ{local}.
In other words, if you do a tag shifting and tune the lambda, you can see two peaks joined together or different based.
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2592715/] section: model evaluation.
BUT GUESS WHAT!
All of this can be done by manually annotating peak (10 is enough!) and then do a Machine Learning model to do this fine tuning for us.
Here is why JBR browser is so good, according to segnor Oleg, all you need to do is load the tuned model into JBR and then manually tweak the parameters (e.g. FDR) to see an optimal model.
Sick!!
By the way, FDR is false discovery rate or Number of control peaks / Number of ChIP peaks. λlocal is critical for ChIP-Seq studies when matching control samples are not available. In such cases FDR can remain pretty low (say 4%) but for λBackGround it can go upto 50%.
"Macs2 is a Gold Standard when we have control. SPAN is a better choice when we have not a well annotated genome"-- according to Segnor Oleg. (correct me please)
Here we are going to do SPAN and Sicer peak calling as well. The pros and cons will be listed accordingly. But before moving on let's observe our results in JBR genome browser. You can download it from here:
https://research.jetbrains.org/groups/biolabs/tools/jbr-genome-browser
The instructions for installation is posted on the site.
Open JBR and select hg19 reference genome.
Select all the bigWig files for Samples and Control
Select all of the files from macs folder.
Select all of the files from macs broad folder.
Go to chromosome 15 (why?)
VOILA!
Let's make the results pretty by selecting colours for the tracks.
Not all ChIP-seq users are interested in the “peaky” data as seen with transcription factors. However nearly all peak callers were developed for exactly this kind of data. SCICER was developed for more diffuse chromatin modifications that can span kilobases or megabases of the genome. Their method scans the genome in windows and identifies clusters of spatial signals that are unlikely to appear by chance. These clusters or “islands” are used rather than fixed length windows, gaps in the islands are allowed to overcome technical issues (under-saturated experiments, repeat regions, etc). And this gap size can be adjusted for different types of chromatin modification. The program makes use of control data or a random background model.
In order to go further we need the hg19 version 30 annotation file (50.6 MB in size, this is for the interpretation step).
#########################################################################################
mkdir $PWD/data/genes
$PWD/data/genes/ wget 'ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/GRCh37_mapping/gencode.v30lift37.annotation.gtf.gz'
gunzip $PWD/data/genes/gencode.v30lift37.annotation.gtf.gz
cd $PWD
#########################################################################################
#########################################################################################
bash $PWD/main/scripts/sicer.sh $PWD/main/data/ hg19 $PWD/data/index/hg19.chrom.sizes 0.05
mkdir $PWD/main/sicer
ls $PWD/main/data/* | grep island | xargs -I {} mv {} $PWD/main/sicer
ls $PWD/main/data/* | grep sicer | xargs -I {} mv {} $PWD/main/sicer
# Command to find out the bed file with biggest number of lines
ls $PWD/main/sicer/*island.bed | xargs wc -l | grep -v total | sort -k1,1nr
# Sort BED file by chromosome and by start position
cat $PWD/main/sicer/GSM1102793_CD14_H3K4me1_hg19.chr15_hg19-W200-G600-FDR0.05-island.bed | sort -k1,1 -k2,2n > sorted.bed
#########################################################################################
$ ls $PWD/main/sicer/*island.bed | xargs wc -l | grep -v total | sort -k1,1nr
2742 /home/manu/Documents/chipseq_and_peak_calling_pipelines/main/sicer/GSM1102793_CD14_H3K4me1_hg19.chr15_hg19-W200-G600-FDR0.05-island.bed
2527 /home/manu/Documents/chipseq_and_peak_calling_pipelines/main/sicer/GSM1102785_CD14_H3K27me3_hg19.chr15_hg19-W200-G600-FDR0.05-island.bed
1776 /home/manu/Documents/chipseq_and_peak_calling_pipelines/main/sicer/GSM1102782_CD14_H3K27ac_hg19.chr15_hg19-W200-G600-FDR0.05-island.bed
1401 /home/manu/Documents/chipseq_and_peak_calling_pipelines/main/sicer/GSM1102788_CD14_H3K36me3_hg19.chr15_hg19-W200-G600-FDR0.05-island.bed
747 /home/manu/Documents/chipseq_and_peak_calling_pipelines/main/sicer/GSM1102797_CD14_H3K4me3_hg19.chr15_hg19-W200-G600-FDR0.05-island.bed
(base)
Let's visualise sorted.bed file in JBR browser with controls and samples.
+----------------------------------+
|SPAN Semi-supervised Peak Analyzer|
+----------------------------|/----+
, ,
__.-'|'-.__.-'|'-.__
='=====|========|====='=
~_^~-^~~_~^-^~-~~^_~^~^~^
SPAN is a tool for analyzing ChIP-seq / ATAC-seq data supporting ultra-low and single-cell input.
https://research.jetbrains.org/groups/biolabs/tools/span-peak-analyzer
We have the files for SPAN and Picard in /main/tools.
If you read the model fitting section, you would have read the following:
Model fitting
SPAN workflow consists of several steps:
1.) Convert raw reads to tags using user-supplied FRAGMENT parameter or maximum cross-correlation estimate. 2.) Compute coverage for all genome tiled into bins of BIN base pairs. 3.) Fit 3-state hidden Markov model that classifies bins as ZERO states with no coverage, LOW states of non-specific binding, and HIGH states of the specific binding. 4.) Compute posterior HIGH state probability of each bin. 5.) Trained model is saved into .span binary format. 6.) Peaks are computed using trained model and FDR and GAP parameters. 7.) If LABELS are provided, optimal parameters are computed to conform with them.
Model fitting mode produces trained model file in binary format as output, which can be:
1.) visualized directly in JBR Genome Browser used in integrated peak calling pipeline 2.) used in integrated peak calling pipeline
If you launch just: $ **bash $PWD/main/scripts/span.sh**
Batch SPAN
Need at least 4 parameters!
So our command has three additional parameters, which we don't really need to specify. We will do this tuning directly in JBR.
#########################################################################################
# Command to launch SPAN peak caller
bash $PWD/main/scripts/span.sh $PWD/main/tools/span-0.10.0.4787.jar $PWD/main/data/ hg19 $PWD/data/index/hg19.chrom.sizes
# Command to get all SPAN models
tree $PWD/main/data/ | grep "\.span"
# Move all the SPAN models to a single folder
mkdir $PWD/main/span_models
find $PWD/main/data/ -name "*.span" | xargs -I {} mv {} $PWD/main/span_models/
#########################################################################################
The basic idea is that once we have peaks, we can open JBR browser and manually annotate upto 10 peaks and train the model to identify peak. The end result is getting a model which we can 'autotune' in JBR for best settings.
If someone is facing troubles with the first command of SPAN (as I did too), the results are present in span models directory in $PWD/data.
mkdir $PWD/main/span_models
cp $PWD/data/span_models/* $PWD/main/span_models
$ls $PWD/main/span_models
GSM1102782_CD14_H3K27ac_hg19_chr15_hg19_GSM1102807_CD14_input_hg19_chr15_hg19_200#afb01.span GSM1102785_CD14_H3K27me3_hg19_chr15_hg19_GSM1102807_CD14_input_hg19_chr15_hg19_200#d839c.span GSM1102788_CD14_H3K36me3_hg19_chr15_hg19_GSM1102807_CD14_input_hg19_chr15_hg19_200#a2e11.span GSM1102793_CD14_H3K4me1_hg19_chr15_hg19_GSM1102807_CD14_input_hg19_chr15_hg19_200#523ef.span GSM1102797_CD14_H3K4me3_hg19_chr15_hg19_GSM1102807_CD14_input_hg19_chr15_hg19_200#9ce44.span
(base)
Let's load these models in JBR.
To exploit the complete power of SPAN, we can load model and manually annotate it. Here is the complete pipleline available: [https://artyomovlab.wustl.edu/aging/howto.html]
But I am going to give a basic idea of the steps anyways:
To do this we start to annotate some peaks (say 10) within JBR browser, by observing the bigWig peaks.
PLEASE NOTE THAT WE ARE DOING ANNOTATION OF PEAKS BASED ON BIGWIG FILES AND NOT SPAN MODELS.
The annotations on bigWig will then help to improve the models.
To add an annotation, first select region:
Move cursor into any track
Press and hold SHIFT key + click and hold left mouse button + move mouse
Release SHIFT key and mouse button
Then set the annotation type:
Click on one of four label buttons or press s/e/n/p key on keyboard
To clear highlighting press ESC
This step of the pipline inculdes annotating the results from peaks we get. In other words, whatever peaks we are observing with positions we try to superimpose them over the gene tracks.
We know already that:
▶ H3K27ac – distinguishes active enhancers from poised ▶ H3K27me3 – repression of transcription; has only one methyltransferase EZH2, which is a part of PRC2 ▶ H3K4me1 – enhancer mark ▶ H3K4me3 – promoters, active transcription ▶ H3K36me3 – gene body, active transcription
Let's begin!
#########################################################################################
# Convert GTF file to BED file, FILTER out the transcripts and SORTING the result and FILTER out all the non-chromosome positions and TAB separated. ONLY 15 chromosome is taken into account (this step takes a lot of time when all chromosomes are selected)
cat $PWD/data/genes/gencode.v30lift37.annotation.gtf | grep -v "#" | grep "^chr15" | awk -v OFS='\t' '($3=="gene") {print $1,$4-1,$5,$10}' | sort -k1,1 -k2,2n > $PWD/main/gencode.v30lift37.annotation.bed
#########################################################################################
Finally we have our .bed file annotated, now we can turn our attention to find the genes and superimpose the results from peak caller.
#########################################################################################
# Find the closest genes for the peak file
bedtools closest -a $PWD/main/macs2_broad/GSM1102797_CD14_H3K4me3_hg19.chr15_hg19_broad_0.1_peaks.broadPeak -b $PWD/main/gencode.v30lift37.annotation.bed -D ref | head -n 1
#########################################################################################
#########################################################################################
# Plot profile signal, which consists of 2 steps
# Step 1 - matrix computation
computeMatrix scale-regions -S $PWD/main/bw/GSM1102797_CD14_H3K4me3_hg19.chr15_hg19.bw -R $PWD/main/gencode.v30lift37.annotation.bed -a 3000 -b 3000 -out matrix
# Step2 - plot profile
plotProfile -m matrix -out ExampleProfile.png --plotTitle "Test data profile"
#########################################################################################
Now we are going to do some R analysis:
Please remember that half of the internet will be installed while trying to install ChIPseeker library (according to segnor Oleg). It took me about 30 minutes to get everything up and running. Just use BiocManager to install the packages.
#_________________________________________________________________________________________
install.packages('BiocManager')
# .libPaths("/tmp/RtmpYs45an/downloaded_packages")
library(BiocManager)
BiocManager::install('ChIPseeker')
BiocManager::install('TxDb.Hsapiens.UCSC.hg19.knownGene')
BiocManager::install('org.Hs.eg.db')
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
peaks <- readPeakFile("/home/manu/Documents/chipseq_and_peak_calling_pipelines/main/macs2_broad/GSM1102797_CD14_H3K4me3_hg19.chr15_hg19_broad_0.1_peaks.broadPeak")
peaks
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
peakAnno <- annotatePeak("/home/manu/Documents/chipseq_and_peak_calling_pipelines/main/macs2_broad/GSM1102797_CD14_H3K4me3_hg19.chr15_hg19_broad_0.1_peaks.broadPeak", tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Hs.eg.db")
plotAnnoPie(peakAnno)
plotAnnoBar(peakAnno)
#_________________________________________________________________________________________
It gives a good information about the features present in your peak files.
Let's return to bash to do some more analysis. We want to prepare our files for gene-enrichment analysis in GREAT.
GREAT is a 'great' tool developed at harvard which does some cool stuff (Boris can explain), to give you all the genes and pathways observed in your data.
Another great enrichment tools is Mount Sinai MC's Enricher. In-fact Enricher encorporates more pathways than GREAT but GREAT has a 'great' feature up it's sleeve i.e. it can take a background file. This acts as kind of a filter and gives us more specific results.
The best thing to do is to combine the results from both of the enrichment analysis tools.
For GSEA (Gene Set Enrichment Analysis) in GREAT we need to reformat our BED file, sadly. Because GREAT is quite strict about the formats, we will do some clipping to our bed file. Additionally, we are going to select top and bottom 100 (total 200) most differentially expressed genes.
#########################################################################################
# Switch back to BASH
# Preparing my peaks file for uploading to GREAT, picking first 3 columns
cat $PWD/main/macs2/GSM1102797_CD14_H3K4me3_hg19.chr15_hg19_q0.05_peaks.narrowPeak | awk -v OFS='\t' '{print($1,$2,$3)}' > $PWD/main/h3k4me3.bed
cd $PWD/main/
# Pay attention to appending results to a file
head -n 100 h3k4me3.bed > h3k4me3_200.bed
tail -n 100 h3k4me3.bed >> h3k4me3_200.bed
#########################################################################################
http://great.stanford.edu/public/cgi-bin/greatWeb.php
Let's upload our h3k4me3.bed file and observe the results. For background we selected the whole genome.
As an exercise, you can upload the top_200 genes bed file, the result should seem surprising. When we were using all genes from Chr15 we were getting lesser pathways enriched, but when we used only 200 genes we get more pathways enriched (why?).
http://amp.pharm.mssm.edu/Enrichr/
Let's upload our h3k4me3.bed file and observe the results.
As a final step we are going to use Chip Atlas. It contains all the GEOsets reanalysed for MACS2. We are doing this to get our results in a bigger picrture. We want to find out if there is data out there to confirm our finding.
The intuition is that, different studies have the data but their question is different. Having our data compared to these studies greatly helps in hypothesis generation. We can find intresting literature, if the questions are similar to ours, we can also take the considerations from the experiments that have already been performed before us.
Our results include:
▶ Enrichment analysis against multiple ChIP-Seq experiments ▶ Matched peaks and genes
https://chip-atlas.org/enrichment_analysis
The most intresting thing about our result is that ChIP-Atlas was able to pinpoint the Cell Type! Monocytes-CD14+
That's all folks!
Please contribute or elaborate to some sections.